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Abstract 

This paper proposes a new randomized strategy for adaptive MCMC using Bayesian optimization. 
This approach apphes to non-difTerentiable objective functions and trades off exploration and exploita- 
tion to reduce the number of potentially costly objective function evaluations. We demonstrate the 
strategy in the complex setting of sampling from constrained, discrete and densely connected proba- 
bilistic graphical models where, for each variation of the problem, one needs to adjust the parameters 
of the proposal mechanism automatically to ensure efficient mixing of the Markov chains. 

1 Introduction 

A common line of attack for solving problems in physics, statistics and machine learning is to draw 
samples from probabihty distributions 7r(-) that are only known up to a normalizing constant. Markov 
chain Monte Carlo (MCMC) algorithms are often the preferred method for accomplishing this sampling 
task, see e.g. Andrieu et al.| (2003) and Robert & Casella (1998). Unfortunately, these algorithms 



typically have parameters that must be tuned in each new situation to obtain reasonable mixing times. 
These parameters are often tuned by a domain expert in a time-consuming and error-prone manual 
process. Adaptive MCMC methods have been developed to automatically adjust the parameters of 
MCMC algorithms. We refer the reader to three recent and excellent comprehensive reviews of the field 



(Andrieu & Thoms 2008 Atchade et al. 2009 Roberts & Rosenthal 2009). 



Adaptive MCMC methods based on stochastic approximation have garnered the most interest out of 
the various adaptive MCMC methods for two reasons. Firstly, they can be shown to be theoretically 
valid. That is, the Markov chain is made inhomogenous by the dependence of the parameter updates 



upon the history of the Markov chain, but its ergodicity can be ensured (Andrieu & Robert 2001 



Andrieu & Moulines 20061 Saksman & Vihola 2010). For example, Theorem 5 of Roberts & Rosenthal 



(20071 establishes two simple conditions to ensure ergodicity: (i) the non-adaptive sampler has to be 



uniformly ergodic and (ii) the level of adaptation must vanish asymptotically. These conditions can be 
easily satisfied for discrete state spaces and finite adaptation. 

Secondly, adaptive MCMC algorithms based on stochastic approximation have been shown to work 
well in practice ( Haario et al.[ 2001 Roberts & Rosenthal 2009 Vihola[ 2010). However, there are 



some limitations to the stochastic approximation approach. Some of the most successful samplers rely 
on knowing either the optimal acceptance rate or the gradient of some objective function of interest. 
Another disadvantage is that these stochastic approximation methods may require many iterations. This 
is particularly problematic when the objective function being optimized by the adaptation mechanism is 
costly to evaluate. Finally, gradient approaches tend to be local and hence they can get trapped in local 
optima when the Markov chains are run for a finite number of steps in practice. 

This paper aims to overcome some of these limitations. It proposes the use of Bayesian optimization 
(iBrochu et all 12009) 



to tune the parameters of the Markov chain. The proposed approach, Bayesian- 
optimized MCMC, has a few advantages over adaptive methods based on stochastic approximation. 

Bayesian optimization does not require that the objective function be differentiable. This enables 
us to be much more flexible in the design of the adaptation mechanisms. We use the area under the 
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auto-correlation function up to a specific lag as the objective function in this paper. This objective has 
been suggested previously by Andrieu & Robert (2001). However, the computation of gradient estimates 
for this objective is very involved and far from trivial (Andrieu & Robert 2001 ). We believe this is one of 
the main reasons why practitioners have not embraced this approach. Here, we show that this objective 
can be easily optimized with Bayesian optimization. We argue that Bayesian optimization endows the 
designer with greater freedom in the design of adaptive strategies. 

Bayesian optimization also has the advantage that it is explicitly designed to trade off exploration and 
exploitation and is implicitly designed to minimize the number of expensive evaluations of the objective 



function (Brochu et al. 2009 Lizotte et al. 2011). 



Another important property of Bayesian-optimized MCMC is that it does not use a specific setting 
for the parameters of the proposal distribution, but rather a distribution over parameter settings with 
probabilities estimated during the adaptation process. Indeed, we find that a randomized policy over a 
set of parameter settings mixes faster than a specific parameter value for the models considered in this 
paper. 

Bayesian optimization has been used with MCMC in Rasmussen (2003) with the intent of approxi- 
mating the posterior with a surrogate function to minimize the cost of hybrid Monte Carlo evaluations. 
The intent in this paper is instead to adapt the parameters of the Markov chain to improve mixing. 

To demonstrate MCMC adaptation with Bayesian optimization, we study the problem of adapting 
a sampler for constrained discrete state spaces proposed recently by Hamze & de Freitas (2010). The 
sampler uses augmentation of the state space in order to make large moves in discrete state space. In 
this sense, the sampler is similar to the Hamiltonian (hybrid) Monte Carlo for continuous state spaces 
(Duane et al. , 1987 Neal, 2010). Although these samplers typically only have two parameters, these are 
very tricky to tune even by experts. Moreover, every time we make a change to the model, the expert is 
again required to spend time tuning the parameters. This is often the case when dealing with sampling 
from discrete probabilistic graphical models, where we often want to vary the topology of the graph as 
part of the data analysis and experimentation process. In the experimental section of this paper, we 
will discuss several Ising models, also known as Boltzmann machines, and show that indeed the optimal 
parameters vary significantly for different model topologies. 



There are existing consistency results for Bayesian optimization ( Vasquez & Beet 2008 Srinivas et al 



2010 Bull 2011 ). These results in combination with the fact that we focus on discrete distributions are 
sufficient for ensuring ergodicity of our adaptive MCMC scheme under vanishing adaptation. However, 
the type of Bayesian optimization studied in this paper relies on latent Gaussian processes whose covari- 
ance grows as the Markov chain progresses. For very large chains, it becomes prohibitive to invert the 
covariance matrix of the Gaussian process. Thus, for practical reasons, we adopt instead a two-stage 
adaptation mechanism, where we first adapt the chain for a finite number of steps and then use the most 
promising parameter values to run the chain again with a mixture of MCMC kernels (Andrieu fc Thoms[ 



2008). Although this adaptation strategy increases the computational cost of the MCMC algorithm, we 



argue that this cost is much lower than the cost of having a human in the loop choosing the parameters. 



2 Adaptive MCMC 



The M etropolis-Hasting (MH) algorithm is the key building block for most MCMC methods (Andrieu 
2003). It draws samples from a target distribution 7r(-) by proposing a move from x*^*' to y'.'+^^ 



et al. 



according to a parameterized proposal distribution qeiy''*^^^ jx*^*') and either accepting it (x(*+^) ~ y*-*^^') 
with probability equal to the acceptance ratio 



7r(x(*))g0(y(*+i)|x(*)) 

or rejecting it (x^*^^) — x^*)) otherwise. 

The parameters of the proposal, e C M'', can have a large influence on sampling performance. For 
example, we will consider constrained discrete probabilistic models in our experiments, where changes 
to the connectivity patterns among the random variables will require different parameter settings. We 
would like to have an approach that can adjust these parameters automatically for all possible connectivity 
patterns. 



2 



Several methods have been proposed to adapt MCMC algorithms. Instead of discussing all of these, 



we refer the reader to the comprehensive reviews of Andrieu & Thoms (2008); Atchade et al. (20091; 



Roberts & Rosenthal (2009). One can adapt parameters other than those of the proposal distribution in 



certain situations, but for the sake of simplicity, we focus here on adapting the proposal distribution. 



One of the most successful adaptive MCMC algorithms was introduced by Haario et al. (2001) and 



several extensions were proposed by Andrieu & Thoms (2008). This algorithm is restricted to the adapta- 



tion of the multivariate random walk Metropolis algorithm with Gaussian proposals. It is motivated by a 



theoretical result regarding the optimal covariance matrix of a restrictive version of this sampler ( Gelman 
et al. 1996). This adaptive algorithm belongs to the family of stochastic approximation methods. 



Some notation needs to be introduced to briefly describe stochastic approximation, which will also 
be useful later when we replace the stochastic approximation method with Bayesian optimization. Let 
Xi = {x(*)}J^i denote the full set of samples up to iteration i of the MH algorithm and yi = {y^*^}t=i 
be the corresponding set of proposed samples. x*^°^ is the initial sample. We will group these samples 
into a single variable Aii = (x(°\ A'j,3^j). Let g{9) be the mean field of the stochastic approximation that 
may only be observed noisily as G{9i, Mi). When optimizing an objective function h{9), this mean field 
corresponds to the gradient of this objective, that is g{9) = Vh{9). Adaptive MCMC methods based on 
stochastic approximation typically use the following Robbins-Monro update: 



9^+l = 9,+-ii+iG{9i,Mi+i), 



(1) 



where 7^+1 is the step-size. This recursive estimate converges almost surely to the roots of g{9) as i — >■ 00 
under suitable conditions. 

We are concerned with the adaptation of discrete models in this paper. The optimal acceptance rates 
are unknown. It is also not clear what objective function one should optimize to adjust the parameters 
of the proposal distribution. One possible choice, as mentioned in the introduction, is to use the auto- 
correlation function up to a certain lag. Intuitively, this objective function seems to be suitable for the 
adaptation task because it is used in practice to assess convergence. Unfortunately, it is difficult to obtain 



efficient estimates of the gradient of this objective (Andrieu & Robert 2001 ). To overcome this difficulty. 



we introduce Bayesian optimization in the following section. 



3 Bayesian Optimization for Adaptive MCMC 

The proposed adaptive strategy consists of two phases: adaptation and sampling. In the adaptation 
phase Bayesian optimization is used to construct a randomized policy. In the sampling phase, a mixture 
of MCMC kernels selected according to the learned randomized policy is used to explore the target 
distribution. The two phases are discussed in more detail subsequently. 

3.1 Adaptation Phase 

Our objective function for adaptive MCMC cannot be evaluated analytically. However, noisy observations 
of its value can be obtained by running the Markov chain for a few steps with a specific choice of 
parameters 9i. Bayesian optimization in the adaptive MCMC setting proposes a new candidate 9i+i by 
approximating the unknown function using the entire history of noisy observations and a prior distribution 
over functions. The prior distribution used in this paper is a Gaussian process. 

The noisy observations are used to obtain the predictive distribution of the Gaussian process. An 
expected utility function derived in terms of the sufficient statistics of the predictive distribution is 
optimized to select the next parameter value 9i+i. The overall procedure is shown in Algorithm [Tj We 
refer readers to Brochu et al. ( 2009[ ) and Lizotte (2008) for in-depth reviews of Bayesian optimization. 



The unknown objective function h{-) is assumed to be distributed according to a Gaussian process 
with mean function m(-) and covariance function fc(-, •): 

h{-)^GP{m{-),k{;-)). 

We adopt a zero mean function m(-) — and an anisotropic Gaussian covariance that is essentially the 
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Algorithm 1 Adaptive MCMC with Bayesian Opt. 

1: for i = 1,2,. . . ,/ do 

2: Run Markov chain for L steps with parameters di. 

3: Use the drawn samples to obtain a noisy evaluation of the objective function: Zi = h{9i) + e. 
4: Augment the data "Di-i = {'Di i_\, {9i,Zi)}. 
5: Update the GP's sufficient statistics. 

6: Find by optimizing an acquisition function: Si+i = arg max^ 

7: end for 



popular ARD kernel (Rasmussen & Williams 2006): 



k{e,,dk) = exp ( - 0fc)^diag(^)-2(0^. _ Ok) 



where V' G ^ is a vector of hyper-parameters. The Gaussian process is a surrogate model for the true 
objective, which typically involves intractable expectations with respect to the invariant distribution and 
the MCMC transition kernels. We describe the objective function used in this work in Section 3.3. 

We assume that the noise in the measurements is Gaussian: Zi ~ h{9i) + e, e ^ A/'(0, cr,^). It is possible 



to adopt other noise models (Diggle et al. 1998). Our Gaussian process emulator has hyper-parameters 



ip and arj. These hyper-parameters are typically computed by maximizing the likelihood (Rasmussen & 
Williams 2006). In Bayesian optimization, we can use Latin hypercube designs to select an initial set of 



parameters and then proceed to maximize the likelihood of the hyper-parameters iteratively (Ye 1998 



Santner et al. [2003 ). This is the approach followed in our experiments. However, a good alternative is 



to use either classical or Bayesian quadrature to integrate out the hyper-parameters (Osborne 2010). 

Let Zi-i ^ A/'(0,K) be the i noisy observations of the objective function obtained from previous 
iterations. (Note that the Markov chain is run for L steps for each discrete iteration i. The extra index 
to indicate this fact has been made implicit to improve readability.) Zi.j and /i^+i are jointly multivariate 
Gaussian: 



K 



where 



K 



fc(0i,0i) 



and k = [k{9, 9i) . . . k{6, 9i)]'^ ■ All the above assumptions about the form of the prior distribution and 
observation model are standard and less restrictive than they might appear at first sight. The central 
assumption is that the objective function is smooth. For objective functions with discontinuities, we 
need more sophisticated surrogate functions for the cost. We refer readers to Gramacy et al. (2004) and 



Brochu et al. (2009j) for examples. 

The predictive distribution for any value 9 follows from the Sherman-Morrison- Woodbury formula, 
where = (9i.,i,Zi.,i): 



^,\V,.,,,9)=Af{fi,{9),<j^{9)) 



Zl: 



k{9,9)-W (K 



The next query point fl^+i is chosen to maximize an acquisition function, u{9\'Di.i), that trades- 
off exploration (where erf (0) is large) and exploitation (where Hi{9) is high). We adopt the expected 



improvement over the best candidate as this acquisition function Schonlau et al. (1998); Brochu et al. 



(20091. This is a standard acquisition function for which asymptotic rates of convergence have been 
proved (Bull 2011). However, we point out that there are a few other reasonable alternatives, such as 



Thompson sampling (May et al. 2011) and upper confidence bounds (UCB) on regret (Srinivas et al. 
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20101. A comparison among these options as well as portfolio strategies to combine them appeared 



recently in (Hoffman et al. 20111. There are several good ways of optimizing the acquisition function, 



including the method of Divi ded RECT angles (DIRECT) of ' Finkel| ( |2003[ ) and many versions of the 
projected Newton methods of Bertsekas (1982). We found DIRECT to provide a very efHcient solution 
in our domain. Note that optimizing the acquisition function is much easier than optimizing the original 
objective function. This is because the acquisition functions can be easily evaluated and differentiated. 



3.2 Sampling Phase 

The Bayesian optimization phase results in a Gaussian process on the / noisy observations of the perfor- 
mance criterion Zi-j, taken at the corresponding locations in parameter space 9i-j. This Gaussian process 
is used to construct a discrete stochastic policy p{6\zi-j) over the parameter space 0. The Markov chain 
is run with parameter settings randomly drawn from this policy at each step. 

One can synthesize the policy p{d\'z,i-j) in several ways. The simplest is to use the mean of the CP 
to construct a distribution proportional to exp(/i(6')). This is the so-called Boltzmann policy. We can 
sample M parameter candidates 9i according to this distribution. Our final sampler then consists of a 
mixture of M transition kernels, where each kernel is parameterized by one of the 9i, i — 1, . . . , M. The 
distribution of the samples generated in the sampling phase will approach the target distribution 7r(-) as 
the number of iterations tends to oo provided the kernels in this finite mixture are ergodic. 

In high dimensions, one reasonable approach would be to use a multi-start optimizer to find maxima 
of the unnormalized Boltzmann policy and then perform local exploration of the modes with a simple 
Metropolis algorithm. This is a slight more sophisticated version of what is often referred to as the epsilon 
greedy policy. 

The strategies discussed thus far do not take into account the uncertainty of the CP. A solution is 
to draw M functions according to the CP and then find he optimizer 9i of each of these functions. This 
is the strategy followed in (May et al. 2011) for the case of contextual bandits. Although this strategy 



works well for low dimensions, it is not clear how it can be easily scaled. 



3.3 Objective Function 

The auto-correlation r(/, X) of a sequence of n generated samples X = {x^-'^^ . . . , x^"^} as a function of 
the time lag between them is defined as 

n—l 

r(/,^)^^^5:(-^*)-xr(x(*^')_^), 

where I is the lag and x and 5^ are the mean and the variance of X, respectively. 

Faster mixing times are characterized by larger values of a(^,nax,'%') — ^ — (^max~^) X)!™!" 
We use this property to construct the criterion for Bayesian optimization as follows. Let £i be the last 
i sampled states (the energies in the experimental section). The performance criterion is obtained by 
taking the average of a(-) over a sliding window within the last L samples, down to a minimum window 
size of 25: ^_^5+i Ef=25 ^0- 



4 Application to Constrained Discrete Distributions 



The Intracluster Move (IM) sampler was recently proposed to generate samples from notoriously-hard 
constrained Boltzmann machines in (Hamze & de Freitas 2010). This sampler has two parameters (one 
continuous and the other discrete) that the authors state to be difficult to tune. This and the recent 
growing interest in discrete Boltzmann machines in machine learning motivated us to apply the proposed 
Bayesian-optimized MCMC method to this problem. 

Boltzmann machines are described in Ackley et al. (1985). Let Xi e {0, 1} denote the i-th random 



variable in the model. The Boltzmann distribution is given by 



7r(x) = 



1 



(2) 



5 



Table 1: Algorithm parameters and sets for each model following (Hamze & de Freitas 
MODEL ALGORITHM C Q 



2010) 



2DGrid 


IMExpert 


{90} 




{0.44} 


2DGrid 


Others 


{!,.. 


. , 300} 


[0,0.8? 


3DCube 


IMExpert 


{!,.. 


.,25} 


{0.8} 


BDCube 


Others 


{!,.. 


.,50} 


[0,1.6] 


RBM 


IMExpert 


{!,.. 


.,20} 


{0.8} 


RBM 


Others 


{!,.. 


.,50} 


[0,1.6] 



where Z(^) = Yl,^es ^ 



and b that are assumed to be known. 



/3_B(x) jg |.-|^g normalizing constant, /3 is a temperature parameter and E{x) = 
is the energy function. Boltzmann machines also have coupling parameters J 



Let Sn{c) be the subset of the states that are at exactly Hamming distance n away from a reference 
state c. The distribution 7r„_c(x) is the restriction of 7r(x) to 5„(c). 7r„^c(x) has Z„(/3, c) = - - . p-P^i^') 
as its normalizing constant and is defined as 



^c(x) = 



2„(,3,c)S 





-/3iS(x) 



if X e Sn{c) 

otherwise 



(3) 



The rest of the paper makes c implicit and uses the simplified notation 5„, 7r„(x) and Z„(/3). These 
constraints on the states are used in statistical physics and in regularized statistical models ( [Hamze "fc 



de Freitas 2010) 



The IM sampler proposes a new state y(*+i) £ 5„ from an original state x^*' G 5„ using self-avoiding 
walks (SAWs) and has parameters 9 = {k, 7), where k E C = {1, 2, . . . , fcmax} is the length of each SAW 
and J £ G — [0, 7max] is the energy-biasing parameter, k determines the size, in terms of the number of 
bits flipped, of the moves through 5„. 7 controls the degree to which higher energy states are favored. 



4.1 Experimental Setup 

The experiments compare the performance of four different sampling methods on three different models. 
The sampling methods are all instances of the IM sampler that only differ in the manner that 7 and k 
are picked: 

Kawasaki sampler : It transitions from state to state within 5„ by uniformly sampling a bit to flip 
to produce a state in Sn+i or Sn-i and then uniformly sampling a bit to flip to return to Sn 



(Kawasaki 19661. This is equivalent to running the IM sampler with 7 fixed to and k fixed to 1. 



IMExpert is the IM sampler manually tuned by a domain expert (Hamze & de Freitas, 2010). The 
expert set 7 to a fixed value 7oxpcrt and k is drawn uniformly in the set C. 

IMUnif is a completely naive approach that draws 7 uniformly from G and k uniformly from C. 

IMBayesOpt is the Bayesian-optimized IM with L = 100 samples generated for each of the 100 adap- 
tations of the parameters. 

The algorithm parameters for each model, defined subsequently, are shown in Table[l] where "Others" 
refers to IMUnif and IMBayesOpt. These two samplers have the same parameter sets because IMUnif is 
a baseline algorithm used to ensure that Bayesian-optimized IM performs better than a naive strategy. 
The parameters for IMUnif and IMBayesOpt were selected such that £ is a much larger superset of the 
SAW lengths used for IMExpert and G is the contiguous interval from to 27oxpcrt- The parameters for 



IMExpert come from (Hamze & de Freitas 2010). The Kawasaki sampler does not have any parameters. 
We consider the three models studied by Hamze & de Freitas (2010). The model parameters are given 



in Table [2j Note that n refers to the Hamming distance from states in Sn to the reference state c. The 
reference state c was set to the ground state where none of the bits are set to 1. This is particularly 
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Table 2: Model parameters from Hamze & de Freitas (20101. n refers to the number of bits that are set 
to 1 out of the total number of bits. /3~^ is the temperature. 

MODEL /3 1 



SIZE 



2DGrid 2.27 60 x 60 1800 of 3600 

SDCube 1.0 9 x 9 x 9 364 of 729 

RBM 1.0 \v\ = 784, \h\ = 500 428 of 1284 



intuitive in machine learning applications, where we might not want more than a small percentage of the 
binary variables on as part of a regularization or variable selection strategy. 



Ferromagnetic 2D grid Ising model: The ferromagnetic 2D grid Ising model is made up of nodes 
arranged in a planar and rectangular grid with connections between the nodes on one boundary to the 
nodes on the other boundary for each dimension (i.e. periodic boundaries), also known as a square 
toroidal grid. Hence, each node has exactly four neighbours. The interaction weights, Jij, are all 1 and 
the biases, bi, are all 0. The temperature 2.27 is the critical temperature of the model. 



Frustrated 3D cube Ising model: The frustrated 3D cube Ising model is made up of nodes arranged 
in a topology that is the three-dimensional analogue of the two-dimensional grid with periodic boundaries. 
Hence, each node has exactly six neighbours. The interaction weights, Jij, are uniformly sampled from 
the set {—1,1} and the biases, hi, are all 0. 



Restricted Boltzmann machine (RBM): The RBM has a bipartite graph structure, with h hidden 
nodes in one partition and v visible nodes in the other. The interaction weights Jij and biases bi are 
exactly the same as in Hamze & de Freitas (20101 and correspond to local Gabor-like filters that capture 
regularities in perceptual inputs. 

Each sampler was run five times with 9 x 10* steps for each run. IMBayesOpt had an additional 10* 
steps for an adaptation phase consisting of 100 adaptations of 100 samples each. IMBayesOpt was not 
penalized for the computational overhead involved in these additional steps because it is seen as being 
far cheaper than having the IM sampler parameters tuned manually. 




Figure 1: [Left] Mean and error bars of the auto-correlation function of the energies of the sampled states 
drawn from the 2D grid Ising model. [Right] Average of the energies. 
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Figure 2: [Left] The mean function of the Gaussian processes over 0, learned by IMBayesOpt for the 
2D grid Ising model. An average over the five trials is shown. [Right] The corresponding policy used for 
sampling. 



All of the algorithms have a burn-in phase consisting of the first 10^ samples generated in the corre- 
sponding sampling phase. The burn-in phase was not included in the computation of the auto-correlation 
functions in figures [TJ |3] and [5] 

IMBayesOpt begins its sampling phase in the same starting state as all of the other samplers, even 
though it would most likely be in a low energy state at the end of its adaptation phase. This ensures 
that the comparison of the sampling methods is fair. 



4.2 Results 

4.2.1 Ferromagnetic 2D grid Ising model 

IMExpert and IMBayesOpt both have very similar mean ACFs, as indicated in figure [T] IMUnif suffers 
from many long strings of consecutive proposal rejections. This is evident from the many intervals where 
the sampled state energy does not change. 



IMBayesOpt 

IMExperl 

IMUnif 

KawasaklSampler 




100 150 200 250 300 350 400 450 500 
Lag 





— IMExpert 

— IMUnif 

— KawasaklSampler 

— IMBayesOpt 











500 1000 1500 2000 2500 3000 3500 4000 
Steps 



Figure 3: [Left] Mean and error bars of the auto-correlation function of the energies of the sampled states 
drawn from the 3D cube Ising model. [Right] Average of the energies. 

Figure [2] suggests that 7 is much more important to the performance of the IM sampler than the 
SAW lengths for this model, especially at large SAW lengths. One of the highest probability points in the 
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Gaussian process corresponds to the parameters chosen by Hamze & de Freitas ( 2010[ ) (7 — 0.44, k — 90). 
Figure [2] also shows M — 1000 samples drawn from the Boltzmann policy. The MCMC results are for 
this policy. The same procedure is adopted for the other models. 



4.2.2 Frustrated 3D cube Ising model 

Figure [3] shows that IMUnif now performs far worse than the IMExpert and IMBayesOpt, implying that 
the extremely rugged energy landscape of the 3D cube Ising model makes manual tuning a non-trivial 
and necessary process. IMBayesOpt performs very similarly to IMExpert, but is automatically tuned 
without any human intervention. 





21.000 
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Figure 4: [Left] The mean function of the Gaussian processes over 0, learned by IMBayesOpt for the 
3D cube Ising model. An average over the five trials is shown. [Right] The corresponding policy used for 
sampling. 

The Gaussian process mean function in Fi gure [4] suggests that SAW lengths should not be longer 
than k = 25, as found in Hamze & de Freitas (2010). Both IMExpert and IMBayesOpt are essentially 



following the same strategy and performing well, while the performance of IMUnif confirms that tuning 
is important for the 3D cube Ising model. 



4.2.3 Restricted Boltzmann Machine (RBM) 



Hamze & de Freitas (2010) find that IMExpert experiences a rapid dropoff in AGE, but this dropoff is 
exaggerated by the inclusion of the burn-in phase. Figure [5] shows a much more modest dropoff when 
the burn- in phase is left out of the ACF computation. However, it still corroborates the claim in |Hamze| 
& de Freitas (2010) that IMExpert performs far better than the Kawasaki sampler. 

Figure[5]shows that IMExpert does not perform much better than IMUnif. The variance of IMExpert 's 
ACF is also much higher than any of the other methods. IMBayesOpt performs significantly better than 
any of the other methods, including manual tuning by a domain expert. 

The Gaussian process mean function in figure [B] suggests that SAW lengths greater than 20 can be 



used and are as effective as shorter ones, whereas Hamze & de Freitas (2010) only pick SAW lengths 
between 1 and 20. This discrepancy is an instance where Bayesian-optimized MCMC has found better 
parameters than a domain expert. 



5 Conclusions 

Experiments were conducted to assess Bayesian optimization for adaptive MCMC. |Hamze fc de Freitas] 
(20101 state that it is easiest to manually tune the 2D grid Ising model and that tuning the 3D cube Ising 
model is more challenging. Manual tuning significantly improves the performance of the IM sampler 
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Figure 5: [Left] Mean and error bars of the auto-correlation function of the energies of the sampled states 
drawn from the 3D cube Ising model. [Right] Average of the energies. 




Figure 6: [Left] The mean function of the Gaussian processes over 0, learned by IMBayesOpt for the 
RBM. An average over the five trials is shown. [Right] The corresponding policy used for sampling. 



for these cases, but Bayesian-optimized MCMC is able to realize the same gains without any human 
intervention. This automatic approach surpasses the human expert for the RBM model. 

Bayesian optimization is a general method for adapting the parameters of any MCMC algorithm. It 
has some advantages over stochastic approximation, as discussed in this paper. However, it presently 
only applies to parameter spaces of up to fifty dimensions. Bayesian optimization should not be seen 
as a replacement for stochastic approximation, but rather as a complementary technique. In particular, 
Bayesian optimization should be adopted when the objective is non-differentiable or expensive to evaluate. 
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